Palantir Technologies – VAST10 Team
Brandon
Wright, Palantir Technologies, bwright@palantirtech.com
Jesse Rickard, Palantir Technologies
Alex Polit, Palantir Technologies
Jason Payne, Palantir Technologies
Overview: Palantir Horizon is part of Palantir’s approach for big data: rapid, interactive analysis of datasets that contain billions of records. Project Horizon was developed as a Palantir “Hack Day” project on top of the Palantir platform; it empowers analysts to start with their entire ecosystem of data (literally billions of rows of data), and iteratively pare the data down to discover the proverbial needle in the haystack.
Horizon
http://www.palantirtech.com/horizon
Background: Palantir is operational today at many of the most prestigious intelligence, defense, law enforcement, and regulation/oversight organizations in the world. Palantir was put together by the founders of PayPal, capitalizing on the lessons learned by their anti-fraud department. Facing highly coordinated cyber attacks in order to commit payment fraud and exploit sensitive consumer information, an entirely new approach was required. Existing technology was poorly suited to dealing with sparse, cyber-specific data. To defeat the international fraud rings, high level conceptual access to the data was required. The analyst-driven intelligence analysis tools that eventually became the Palantir platform were a direct outgrowth of this effort.
Company Web site:
http://www.palantirtech.com
Check out our Analysis Blog to see more analysis using Palantir: http://www.palantirtech.com/government/analysis-blog.
Video:
ANSWERS:
MC2.1: Analyze the
records you have been given to characterize the spread of the disease. You should take into consideration symptoms
of the disease, mortality rates, temporal patterns of the onset, peak and
recovery of the disease. Health
officials hope that whatever tools are developed to analyze this data might be
available for the next epidemic outbreak.
They are looking for visualization tools that will save them analysis
time so they can react quickly.
We used Palantir’s Horizon platform to analyze the spread of the disease. Our approach involved visualizing the key macro characteristics of the entire dataset of approximately 15 million hospitalization records, and iteratively deriving key subsets and properties in order to produce more granular visualizations.
In terms of time commitment, our team parsed all symptoms manually in Excel and Access, which took several hours. Importing the data into Palantir took several minutes. Actual workflows proceeded essentially as quickly as the analyst could think of ways to view the data. Each custom view took anywhere from 0.5 to 3 seconds to generate. Total analysis time was approximately 2.5 hours of a single analyst creating near-instantaneous views of the data to understand the behavior patterns of the outbreak.
Our analysis of Mini Challenge 2.1 comprised two main
sections. We began by performing
analysis of temporal patterns, which allowed us to isolate subsets of likely
disease-related deaths vs. non-disease-related deaths among total
hospitalizations. From there, we
performed statistical analyses of the prevalence of various symptoms for each
subset as well as the correlation of those symptoms with mortality rates.
Temporal Patterns:
We used the “date” property of the entire data set to plot the hospital admittances on a timeline. No clear pattern emerged, but we did find a general peak in hospitalizations around the middle of May. However, once we drilled down on hospitalizations resulting in death, the timeline revealed a strong pattern, with a peak in hospitalizations occurring on the 16th of May:
Figure MC2.1.1: Timeline of hospitalizations. The first timeline represents total
hospitalizations by date, while the second represents only those
hospitalizations resulting in death.
In order to isolate likely disease-related deaths, we wanted to determine how long it took patients to die after being hospitalized. We accomplished this by deriving new property, “days til death”, which subtracts the death date from the hospitalization date according to the formula [(${death date} - $date) / 86400000].
We then produced a property value histogram of the “days til death” property, which revealed that almost all deaths (96.9%) occurred 8 days after hospitalization:
Figure MC2.1.2: Property value histogram showing distribution of cases by “days til death” property.
Statistical Patterns:
It should be acknowledged that it is difficult to distinguish hospitalizations related to the disease from those that are not. However, it is easier to distinguish disease-related deaths, since they occurred 8 days after hospitalization.
We generated property value histograms for the property “symptoms” for all hospitalizations; hospitalizations resulting in death from the disease (death after 8 days); and all other hospitalizations resulting in death. The key symptoms for the disease are Vomiting, Abdominal Pain, Nose Bleed, Back Pain and Diarrhea (Fever is also common in disease cases, but only slightly more so than in non-disease cases – 9.6% vs. 8.0%).
Figure MC2.1.3: A summary view of cases in which “days til death” = 8, viewed alongside a histogram of symptoms for those cases.
The following list provides a breakdown of the top symptoms for each subset of cases:
Vomiting 10.5%
Fever 8.3%
Abdominal Pain 8.1%
Diarrhea 4.3%
Back Pain 4.1%
Headache3.0%
Rash 2.7%
Blurred Vision 1.7%
Cough 1.7%
Swelling 1.6%
Nose Bleed 1.2%
Vomiting 29.6%
Abdominal Pain 21.4%
Diarrhea 12.4%
Back Pain 10.3%
Fever 9.6%
Swelling 3.4%
Nose Bleed 3.4%
Neck Pain 1.4%
Headache 1.4%
Blurred Vision 1.4%
Tremors 0.7%
Hearing Loss 0.7%
Abnormal Labs 0.7%
Nausea 0.7%
Proteinuria 0.7%
Leg Pain 0.7%
Rash 0.7%
Conjunctivitis 0.7%
Fever 8.0%
Vomiting 4.9%
Abdominal Pain 4.1%
Headache 3.6%
Rash 3.3%
Cough 2.4%
Back Pain 2.2%
Diarrhea 2.1%
Blurred Vision 1.8%
Chest Pain 1.2%
Dizziness 1.1%
Nausea 1.1%
Weakness 1.1%
Swelling 1.1%
Shortness Of Breath 1.1%
To determine which of these symptoms are most likely to be fatal, we generated a “Pocket Histogram” in order to view property values positively correlated with death. Among total hospitalizations, a number of symptoms show a positive correlation with death, most frequently Tremors:
Figure MC2.1.4: Pocket Histogram showing correlation of various symptoms with death among total hospitalizations.
We noted the following correlations among both total hospitalizations and those resulting in death after 8 days:
·
Symptoms Positively Associated With
Death (within total hospitalizations):
symptoms='Tremors' / 4.22 / 3901
symptoms='Hearing Loss' / 4.06 / 3755
symptoms='Abnormal Labs' / 4.06 / 3753
symptoms='Proteinuria' / 4.01 / 3707
symptoms='Conjunctivitis' / 3.93 / 3652
symptoms='Vomiting' / 3.59 / 93214
symptoms='Nose Bleed' / 3.32 / 18453
symptoms='Diarrhea' / 3.24 / 3771
symptoms='Abdominal Pain' / 3.18 / 112013
symptoms='Back Pain' / 2.91 / 48568
symptoms='Swelling' / 2.01 / 11126
symptoms='Pregnant' / 1.34 / 213
symptoms='Fever' / 1.29 / 41781
symptoms='Vaginal Problems' / 1.18 / 645
·
Symptoms positively associated with
8-day deaths (i.e. deaths from the disease):
symptoms='Hearing Loss' / 1.03 / 3750
symptoms='Abnormal Labs' / 1.03 / 3747
symptoms='Proteinuria' / 1.03 / 3701
symptoms='Tremors' / 1.03 / 3891
symptoms='Vomiting' / 1.03 / 92899
symptoms='Conjunctivitis' / 1.03 / 3639
symptoms='Abdominal Pain' / 1.03 / 111502
symptoms='Diarrhea' / 1.03 / 3751
symptoms='Nose Bleed' / 1.03 / 18351
symptoms='Back Pain' / 1.03 / 48289
symptoms='Pregnant' / 1.02 / 211
symptoms='Vaginal Problems' / 1.02 / 638
symptoms='Swelling' / 1.02 / 10999
symptoms='Fever' / 1.01 / 40806
Additionally, the pocket histogram for total hospitalizations indicates that chances of death are fairly similar for each city with the exceptions of Nonthaburi, Thailand and Mersin, Turkey, both of which have extremely low death rates. This may indicate that the disease did not spread to Thailand or Turkey:
Figure MC2.1 5: Correlation of various locations with death among overall hospitalizations.
Mortality Rates:
2.5% of overall hospitalizations resulted in death. 2.2% of overall cases of hospitalizations resulted in death that appeared related to the disease. Mortality rates varied slightly between locations (Turkey and Thailand are extremely low, as they were most likely unaffected by the disease)
MC2.2: Compare
the outbreak across cities. Factors to
consider include timing of outbreaks, numbers of people infected and recovery
ability of the individual cities. Identify
any anomalies you found.
Timing of Outbreak:
In order to create a statistical reference point, we generated the following table using figures derived from time histograms of death dates (both overall and disease-related) for each location.
|
First death |
First seemingly
disease-related death |
Initial small jump in
deaths |
Big jump in deaths |
Peak |
Last death |
Nairobi |
4/27/2009 |
4/27/2009 |
5/2/2009 |
5/4/2009 |
5/22/2009 |
6/24/2009 |
Aleppo |
4/28/2009 |
4/28/2009 |
5/4/2009 |
5/5/2009 |
5/23/2009 |
6/30/2009 |
Yemen |
4/29/2009 |
4/29/2009 |
5/4/2009 |
5/6/2009 |
5/24/2009 |
6/30/2009 |
Lebanon |
4/29/2009 |
4/29/2009 |
5/4/2009 |
5/6/2009 |
5/25/2009 |
6/26/2009 |
Karachi |
4/30/2009 |
4/30/2009 |
5/6/2009 |
5/6/2009 |
5/25/2009 |
6/29/2009 |
Saudi Arabia |
4/24/2009 |
5/2/2009 |
5/5/2009 |
5/7/2009 |
5/25/2009 |
6/28/2009 |
Venezuela |
5/1/2009 |
5/1/2009 |
5/5/2009 |
5/8/2009 |
5/27/2009 |
6/28/2009 |
Iran |
5/2/2009 |
5/2/2009 |
5/6/2009 |
5/8/2009 |
5/27/2009 |
6/29/2009 |
Colombia |
5/2/2009 |
5/2/2009 |
5/6/2009 |
5/9/2009 |
5/28/2009 |
6/30/2009 |
The first reported death for each city fell within the period from 4/27/2009 to 5/2/2009. In most cases, the first death and the first seemingly disease-related death occurred on the same date. However, in the case of Saudi Arabia, the first disease-related death occurred 8 days after the first death overall. An initial small jump in deaths occurred 4-6 days after the first disease-related death in each location. A bigger jump occurred anywhere from 0-3 days after the initial jump (both jumps occurred the same day in the case of Karachi only). For all locations, deaths peaked approximately 25 days after the first reported death, then subsided over the next month or so.
Infections By
Location: Relative and Total
In order to visualize the relative geographic concentration of disease-related deaths, we generated a heatmap for all locations. Note that Turkey and Thailand are deep blue, indicating few deaths.
Figure MC2.2.1: Heatmap of disease-related deaths by
outbreak location.
We can also view total disease-related deaths per location using a property value histogram based on the “location” property for each death:
Figure MC2.2.2: Total deaths from disease by location.
Recovery Ability:
As noted above, in each country, deaths peaked approximately 25 days after the first reported death, then began dropping, finally tapering off approximately a month after the peak. This can be seen in the above table, and also illustrated in this scattergram of daily deaths by location:
Figure MC2.2.3:
Scattergram of deaths. Y=
location, X=day number, scale=percentile.
In general, the timelines of the disease for each location are extremely similar, whether we were looking at the date of hospitalization or the date of death. The only major difference was found in the dates the disease first appeared in each location. Below is a figure comparing the date of death timelines for Karachi, Pakistan (165,605 total deaths) and Aden, Yemen (7,711 total deaths):
Figure MC2.2.4: Time histograms showing dates of death
from disease for Karachi, Pakistan, and Aden, Yemen.
Anomalies:
The most obvious anomaly was the fact that Turkey and Thailand were seemingly unaffected by the disease; however, we lacked sufficient data to form a hypothesis as to why this would be the case.
Another notable anomaly concerned the ages of those hospitalized. Time histograms for hospitalizations resulting in death, hospitalizations resulting in death within 8 days, and all other deaths each formed an almost-perfect bell curve, with a peak age between 43 and 45. This is an anomaly, since the 0-40 range usually forms the bulk of most populations, but comprises relatively small percentage of hospital visits. This seems to indicate that people less than 43-45 years old are less susceptible to the disease, and the closer the patient’s age is to zero, the less susceptible he/she is:
Figure MC2.2.5:
Time histogram showing the age distribution of patients thought to have
died of the disease (8 days after being hospitalized).